| Objective | Complete |
|---|---|
| Create Term-Document Matrix | |
| Explore the distribution of words in corpus |
Document-term matrix is simply a matrix of unique
words counted in each document:
The corpus vocabulary consists of all of the unique terms (i.e. column names of DTM) and their total counts across all documents (i.e. column sums)
A Term-Document Matrix will be just the transpose of the Document-Term Matrix, with terms in rows and documents in columns
scikit-learn, it is used heavily for machine learning. You
can find complete documentation here.CountVectorizer from scikit-learn library’s
feature_extraction module for working with textIt takes a list of character strings that represent
the documents as the main argument, passed to its
fit_transform() method:
.fit_transform(list_of_documents)It returns a 2D array (i.e. a matrix) with documents in rows and
terms in columns - the DTM
# Transform the list of documents clean documents `df_clean_list` into DTM.
X = vec.fit_transform(df_clean_list)
print(X.toarray()) #<- to show output as a matrix[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 1 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
.get_feature_names_out()['abduct' 'abl' 'abo' 'absente' 'abus' 'academ' 'accept' 'access'
'accessori' 'accommod']
# Convert the matrix into a Pandas DataFrame for easier manipulation.
DTM = pd.DataFrame(X.toarray(), columns = vec.get_feature_names_out())
print(DTM.head()) abduct abl abo absente abus academ accept access ... year yell yet york young yuan zimbabw zykera
0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0
3 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
[5 rows x 1921 columns]
n entrieslambda within the function# Create a convenience function that sorts and looks at first n-entries in the dictionary.
def HeadDict(dict_x, n):
# Get items from the dictionary and sort them by
# value key in descending (i.e. reverse) order
sorted_x = sorted(dict_x.items(),
reverse = True,
key = lambda kv: kv[1])
# Convert sorted dictionary to a list.
dict_x_list = list(sorted_x)
# Return the first `n` values from the dictionary only.
return(dict(dict_x_list[:n]))abduct 1
abl 1
abo 1
absente 1
abus 2
dtype: int64
{'said': 38, 'new': 36, 'presid': 28, 'year': 27, 'friday': 22, 'govern': 22}
| What we need | What we have learned |
|---|---|
A corpus of documents cleaned and processed in a certain way
|
✔ |
| A Document-Term Matrix (DTM): with counts of each word recorded for each document |
✔ |
| A transformed representation of a Document-Term Matrix (i.e. weighted with TF-IDF weights) |
| Objective | Complete |
|---|---|
| Create Term-Document Matrix |
✔ |
| Explore the distribution of words in corpus |
An n-gram is a sequence of words that occur together in a sentence
This concept is used extensively in the field of Natural Language Processing to build n-gram based models which have various use cases like:
We’re not going to dig deeper into these models in this course,
but let’s learn how to create n-grams using nltk
package!
An n-gram is called:
To create an n-gram, we will use ngrams from
nltk' library'sutil` module. You can find the complete
documentation for the ngrams here
ngrams functionnick kyrgio start brisban open titl defens battl victori american ryan harrison open round tuesday
[('nick', 'kyrgio'), ('kyrgio', 'start'), ('start', 'brisban'), ('brisban', 'open'), ('open', 'titl'), ('titl', 'defens'), ('defens', 'battl'), ('battl', 'victori'), ('victori', 'american'), ('american', 'ryan'), ('ryan', 'harrison'), ('harrison', 'open'), ('open', 'round'), ('round', 'tuesday')]
[('nick', 'kyrgio', 'start'), ('kyrgio', 'start', 'brisban'), ('start', 'brisban', 'open'), ('brisban', 'open', 'titl'), ('open', 'titl', 'defens'), ('titl', 'defens', 'battl'), ('defens', 'battl', 'victori'), ('battl', 'victori', 'american'), ('victori', 'american', 'ryan'), ('american', 'ryan', 'harrison'), ('ryan', 'harrison', 'open'), ('harrison', 'open', 'round'), ('open', 'round', 'tuesday')]
def generate_ngrams(df_clean_list):
for i in range(len(df_clean_list)):
for n in range(2, 4):
n_grams = ngrams(df_clean_list[i].split(), n)
for grams in n_grams:
print(grams)('nick', 'kyrgio')
('kyrgio', 'start')
('start', 'brisban')
('brisban', 'open')
('open', 'titl')
('titl', 'defens')
('defens', 'battl')
('battl', 'victori')
('victori', 'american')
('american', 'ryan')
('ryan', 'harrison')
('harrison', 'open')
('open', 'round')
('round', 'tuesday')
('nick', 'kyrgio', 'start')
('kyrgio', 'start', 'brisban')
('start', 'brisban', 'open')
('brisban', 'open', 'titl')
('open', 'titl', 'defens')
('titl', 'defens', 'battl')
('defens', 'battl', 'victori')
('battl', 'victori', 'american')
('victori', 'american', 'ryan')
('american', 'ryan', 'harrison')
('ryan', 'harrison', 'open')
('harrison', 'open', 'round')
('open', 'round', 'tuesday')
('british', 'polic')
('polic', 'confirm')
('confirm', 'tuesday')
('tuesday', 'treat')
('treat', 'stab')
('stab', 'attack')
('attack', 'injur')
('injur', 'three')
('three', 'peopl')
('peopl', 'manchest')
('manchest', 'victoria')
('victoria', 'train')
('train', 'station')
('station', 'terrorist')
('terrorist', 'investig')
('investig', 'search')
('search', 'address')
('address', 'cheetham')
('cheetham', 'hill')
('hill', 'area')
('area', 'citi')
('british', 'polic', 'confirm')
('polic', 'confirm', 'tuesday')
('confirm', 'tuesday', 'treat')
('tuesday', 'treat', 'stab')
('treat', 'stab', 'attack')
('stab', 'attack', 'injur')
('attack', 'injur', 'three')
('injur', 'three', 'peopl')
('three', 'peopl', 'manchest')
('peopl', 'manchest', 'victoria')
('manchest', 'victoria', 'train')
('victoria', 'train', 'station')
('train', 'station', 'terrorist')
('station', 'terrorist', 'investig')
('terrorist', 'investig', 'search')
('investig', 'search', 'address')
('search', 'address', 'cheetham')
('address', 'cheetham', 'hill')
('cheetham', 'hill', 'area')
('hill', 'area', 'citi')
('marcellu', 'wiley')
('wiley', 'still')
('still', 'fenc')
('fenc', 'let')
('let', 'young')
('young', 'son')
('son', 'play')
('play', 'footbal')
('footbal', 'former')
('former', 'nfl')
('nfl', 'defens')
('defens', 'end')
('end', 'fox')
('fox', 'sport')
('sport', 'person')
('person', 'tell')
('tell', 'podcaston')
('podcaston', 'sport')
('sport', 'like')
('like', 'nfl')
('nfl', 'tri')
('tri', 'make')
('make', 'footbal')
('footbal', 'safer')
('safer', 'game')
('game', 'de')
('marcellu', 'wiley', 'still')
('wiley', 'still', 'fenc')
('still', 'fenc', 'let')
('fenc', 'let', 'young')
('let', 'young', 'son')
('young', 'son', 'play')
('son', 'play', 'footbal')
('play', 'footbal', 'former')
('footbal', 'former', 'nfl')
('former', 'nfl', 'defens')
('nfl', 'defens', 'end')
('defens', 'end', 'fox')
('end', 'fox', 'sport')
('fox', 'sport', 'person')
('sport', 'person', 'tell')
('person', 'tell', 'podcaston')
('tell', 'podcaston', 'sport')
('podcaston', 'sport', 'like')
('sport', 'like', 'nfl')
('like', 'nfl', 'tri')
('nfl', 'tri', 'make')
('tri', 'make', 'footbal')
('make', 'footbal', 'safer')
('footbal', 'safer', 'game')
('safer', 'game', 'de')
('still', 'reckon')
('reckon', 'fallout')
('fallout', 'emmett')
('emmett', 'till')
('till', 'paint')
('paint', 'chasten')
('chasten', 'artist')
('artist', 'reveal')
('reveal', 'controversi')
('controversi', 'chang')
('chang', 'even')
('even', 'move')
('move', 'forward')
('forward', 'new')
('new', 'galleri')
('galleri', 'show')
('still', 'reckon', 'fallout')
('reckon', 'fallout', 'emmett')
('fallout', 'emmett', 'till')
('emmett', 'till', 'paint')
('till', 'paint', 'chasten')
('paint', 'chasten', 'artist')
('chasten', 'artist', 'reveal')
('artist', 'reveal', 'controversi')
('reveal', 'controversi', 'chang')
('controversi', 'chang', 'even')
('chang', 'even', 'move')
('even', 'move', 'forward')
('move', 'forward', 'new')
('forward', 'new', 'galleri')
('new', 'galleri', 'show')
('far', 'arik')
('arik', 'ogunbowal')
('ogunbowal', 'coach')
('coach', 'muffet')
('muffet', 'mcgraw')
('mcgraw', 'concern')
('concern', 'notr')
('notr', 'dame')
('dame', 'victori')
('victori', 'louisvil')
('louisvil', 'thursday')
('thursday', 'night')
('night', 'anoth')
('anoth', 'atlant')
('atlant', 'coast')
('coast', 'confer')
('confer', 'game')
('game', 'januari')
('far', 'arik', 'ogunbowal')
('arik', 'ogunbowal', 'coach')
('ogunbowal', 'coach', 'muffet')
('coach', 'muffet', 'mcgraw')
('muffet', 'mcgraw', 'concern')
('mcgraw', 'concern', 'notr')
('concern', 'notr', 'dame')
('notr', 'dame', 'victori')
('dame', 'victori', 'louisvil')
('victori', 'louisvil', 'thursday')
('louisvil', 'thursday', 'night')
('thursday', 'night', 'anoth')
('night', 'anoth', 'atlant')
('anoth', 'atlant', 'coast')
('atlant', 'coast', 'confer')
('coast', 'confer', 'game')
('confer', 'game', 'januari')
('prohibit', 'vacat')
('vacat', 'rental')
('rental', 'arrang')
('arrang', 'onlin')
('onlin', 'airbnb')
('airbnb', 'move')
('move', 'closer')
('closer', 'realiti')
('realiti', 'thursday')
('thursday', 'new')
('new', 'orlean')
('prohibit', 'vacat', 'rental')
('vacat', 'rental', 'arrang')
('rental', 'arrang', 'onlin')
('arrang', 'onlin', 'airbnb')
('onlin', 'airbnb', 'move')
('airbnb', 'move', 'closer')
('move', 'closer', 'realiti')
('closer', 'realiti', 'thursday')
('realiti', 'thursday', 'new')
('thursday', 'new', 'orlean')
('contamin', 'food')
('food', 'smell')
('smell', 'like')
('like', 'freedom')
('contamin', 'food', 'smell')
('food', 'smell', 'like')
('smell', 'like', 'freedom')
('end', 'sight')
('sight', 'partial')
('partial', 'feder')
('feder', 'shutdown')
('shutdown', 'distress')
('distress', 'feder')
('feder', 'worker')
('worker', 'paycheck')
('paycheck', 'sight')
('sight', 'either')
('end', 'sight', 'partial')
('sight', 'partial', 'feder')
('partial', 'feder', 'shutdown')
('feder', 'shutdown', 'distress')
('shutdown', 'distress', 'feder')
('distress', 'feder', 'worker')
('feder', 'worker', 'paycheck')
('worker', 'paycheck', 'sight')
('paycheck', 'sight', 'either')
('bottleneck', 'offload')
('offload', 'import')
('import', 'fuel')
('fuel', 'form')
('form', 'mexican')
('mexican', 'oil')
('oil', 'port')
('port', 'follow')
('follow', 'govern')
('govern', 'order')
('order', 'shut')
('shut', 'pipelin')
('pipelin', 'limit')
('limit', 'loss')
('loss', 'widespread')
('widespread', 'fuel')
('fuel', 'theft')
('theft', 'accord')
('accord', 'trader')
('trader', 'refinitiv')
('refinitiv', 'eikon')
('eikon', 'data')
('bottleneck', 'offload', 'import')
('offload', 'import', 'fuel')
('import', 'fuel', 'form')
('fuel', 'form', 'mexican')
('form', 'mexican', 'oil')
('mexican', 'oil', 'port')
('oil', 'port', 'follow')
('port', 'follow', 'govern')
('follow', 'govern', 'order')
('govern', 'order', 'shut')
('order', 'shut', 'pipelin')
('shut', 'pipelin', 'limit')
('pipelin', 'limit', 'loss')
('limit', 'loss', 'widespread')
('loss', 'widespread', 'fuel')
('widespread', 'fuel', 'theft')
('fuel', 'theft', 'accord')
('theft', 'accord', 'trader')
('accord', 'trader', 'refinitiv')
('trader', 'refinitiv', 'eikon')
('refinitiv', 'eikon', 'data')
('follow', 'reaction')
('reaction', 'andi')
('andi', 'murray')
('murray', 'announc')
('announc', 'friday')
('friday', 'year')
('year', 'australian')
('australian', 'open')
('open', 'could')
('could', 'last')
('last', 'tournament')
('tournament', 'profession')
('follow', 'reaction', 'andi')
('reaction', 'andi', 'murray')
('andi', 'murray', 'announc')
('murray', 'announc', 'friday')
('announc', 'friday', 'year')
('friday', 'year', 'australian')
('year', 'australian', 'open')
('australian', 'open', 'could')
('open', 'could', 'last')
('could', 'last', 'tournament')
('last', 'tournament', 'profession')
You are now ready to try Tasks 7-9 in the Exercise for this topic
| Objective | Complete |
|---|---|
| Create Term-Document Matrix |
✔ |
| Explore the distribution of words in corpus |
✔ |
In this part of the course, we have covered: